Failure Data-Driven Selective Node-Level Duplication to Improve MTTF in High Performance Computing Systems

نویسندگان

  • Nithin Nakka
  • Alok N. Choudhary
چکیده

This paper presents our analysis of the failure behavior of large scale systems using the failure logs collected by Los Alamos National Laboratory on 22 of their computing clusters.We note that not all nodes show similar failure behavior in the systems. Our objective, therefore, was to arrive at an ordering of nodes to be incrementally (one by one) selected for duplication so as to achieve a target MTTF for the system after duplicating the least number of nodes. We arrived at a model for the fault coverage provided by duplicating each node and ordered the nodes according to coverage provided by each node. As compared to traditional approach of randomly choosing nodes for duplication, our model‐driven approach provides improvements ranging from 82% to 1700% depending on the improvement in MTTF that is targeted and the failure distribution of the nodes in the system.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Green Energy-aware task scheduling using the DVFS technique in Cloud Computing

Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...

متن کامل

Relative MTTF-Based Incentive Scheme for Availability-Based Replication in P2P Systems

When P2P systems are used for data sensitive systems, the data availability has become an important issue. The availabilitybased replication using individual node availability is the most popular method keeping high data availability efficiently. However, since the individual node availability is derived by the individual lifetime information of each node, the availability-based replication may...

متن کامل

Cooperative Orthogonal Space-Time-Frequency Block Codes over a MIMO-OFDM Frequency Selective Channel

In this paper, a cooperative algorithm to improve the orthogonal space-timefrequency block codes (OSTFBC) in frequency selective channels for 2*1, 2*2, 4*1, 4*2 MIMO-OFDM systems, is presented. The algorithm of three node, a source node, a relay node and a destination node is formed, and is implemented in two stages. During the first stage, the destination and the relay antennas receive the sym...

متن کامل

E2DR: Energy Efficient Data Replication in Data Grid

Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...

متن کامل

The Case for Modular Redundancy in Large-scale High Performance Computing Systems

Recent investigations into resilience of large-scale highperformance computing (HPC) systems showed a continuous trend of decreasing reliability and availability. Newly installed systems have a lower mean-time to failure (MTTF) and a higher mean-time to recover (MTTR) than their predecessors. Modular redundancy is being used in many mission critical systems today to provide for resilience, such...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009